Analysis of Traffic Accidents in the USA¶

Introduction¶

image.png

In this report, I attempt to analyse the trasffic-related fatalities in the states of the USA. Preliminary examination of demographic information regarding traffic accident victims in each US state, it became evident that there is are significant disparities among the states. This sets forth the primary objective of the study i.e., to identify patterns in this disparity and develop a targeted action plan, rather than implementing a costly nationwide initiative. The data used by the report was originally collected by the National Highway Traffic Safety Administration and the National Association of Insurance Commissioners. The dataset was compiled by FiveThirtyEight.

Preliminary anlsysis shows that there has been a persistent rise in the number of traffic-related fatalities, despite a decline in the rate of fatal road accidents since the 1980s, has reached an alarming ten-year high, fueled by the growing miles driven on our nation's roads. The report aims to employ a combination of data analysis techniques including data wrangling, visualization, dimensionality reduction, and unsupervised clustering to explore the reasons for such patterns and recommendations to mitigate the fatalities.

In [1]:
# inspecting the dataset
accidents_head = !head -n 10 bad-drivers.csv
accidents_head
Out[1]:
['State,Number of drivers involved in fatal collisions per billion miles,Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding,Percentage Of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired,Percentage Of Drivers Involved In Fatal Collisions Who Were Not Distracted,Percentage Of Drivers Involved In Fatal Collisions Who Had Not Been Involved In Any Previous Accidents,Car Insurance Premiums ($),Losses incurred by insurance companies for collisions per insured driver ($)',
 'Alabama,18.8,39,30,96,80,784.55,145.08',
 'Alaska,18.1,41,25,90,94,1053.48,133.93',
 'Arizona,18.6,35,28,84,96,899.47,110.35',
 'Arkansas,22.4,18,26,94,95,827.34,142.39',
 'California,12,35,28,91,89,878.41,165.63',
 'Colorado,13.6,37,28,79,95,835.5,139.91',
 'Connecticut,10.8,46,36,87,82,1068.73,167.02',
 'Delaware,16.2,38,30,87,99,1137.87,151.48',
 'District of Columbia,5.9,34,27,100,100,1273.89,136.05']

Preliminary overview of the dataset¶

In this section, the report provides a concise analysis to familiarize ourselves with the data under consideration.

In [2]:
# Import the `pandas` module as "pd"
import pandas as pd

# Read in `road-accidents.csv`
car_acc = pd.read_csv('bad-drivers.csv', comment='#')

# Save the number of rows columns as a tuple
rows_and_cols = car_acc.shape
print('There are {} rows and {} columns.\n'.format(
    rows_and_cols[0], rows_and_cols[1]))

# Generate an overview of the DataFrame
car_acc_information = car_acc.info()
print(car_acc_information)

car_acc = car_acc.rename(columns={
    'State': 'state',
    'Number of drivers involved in fatal collisions per billion miles': 'Fatality_rate_Drivers',
    'Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding': 'Speeders',
    'Percentage Of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired': 'Alcohol-Impaired',
    'Percentage Of Drivers Involved In Fatal Collisions Who Were Not Distracted': 'Not_Distracted',
    'Percentage Of Drivers Involved In Fatal Collisions Who Had Not Been Involved In Any Previous Accidents': 'No_Previous_Accidents',
    'Car Insurance Premiums ($)': 'Premiums',
    'Losses incurred by insurance companies for collisions per insured driver ($)': 'Losses_Per_Driver'
})

# Display the last five rows of the DataFrame
car_acc.tail()
There are 51 rows and 8 columns.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 8 columns):
 #   Column                                                                                                  Non-Null Count  Dtype  
---  ------                                                                                                  --------------  -----  
 0   State                                                                                                   51 non-null     object 
 1   Number of drivers involved in fatal collisions per billion miles                                        51 non-null     float64
 2   Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding                                    51 non-null     int64  
 3   Percentage Of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired                            51 non-null     int64  
 4   Percentage Of Drivers Involved In Fatal Collisions Who Were Not Distracted                              51 non-null     int64  
 5   Percentage Of Drivers Involved In Fatal Collisions Who Had Not Been Involved In Any Previous Accidents  51 non-null     int64  
 6   Car Insurance Premiums ($)                                                                              51 non-null     float64
 7   Losses incurred by insurance companies for collisions per insured driver ($)                            51 non-null     float64
dtypes: float64(3), int64(4), object(1)
memory usage: 3.3+ KB
None
Out[2]:
state Fatality_rate_Drivers Speeders Alcohol-Impaired Not_Distracted No_Previous_Accidents Premiums Losses_Per_Driver
46 Virginia 12.7 19 27 87 88 768.95 153.72
47 Washington 10.6 42 33 82 86 890.03 111.62
48 West Virginia 23.8 34 28 97 87 992.61 152.56
49 Wisconsin 13.8 36 33 39 84 670.31 106.62
50 Wyoming 17.4 42 32 81 90 791.14 122.04

Graphical summary and correlation plots¶

In this sectionn, the report aims of gaining a deeper understanding of the dataset by create a textual and graphical summary of the data. It begins by computing key summary statistics, including measures of central tendency and dispersion, to gain a sense of the distribution of the variables within the data. Additionally, it produces histograms for each column to provide a visual representation of the distributions. To further explore the relationships between variables, it also create a pairwise scatter plot matrix and correlation table among the variables, which provides a visual representation of the relationship between all columns in the dataset. This will help us to identify patterns and relationships in the data and inform our subsequent analysis.

In [3]:
# import seaborn and make plots appear inline
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Compute the summary statistics of all columns in the `car_acc` DataFrame
sum_stat_car = car_acc.describe()
print(sum_stat_car)

# Compute the correlation coefficent for all column pairs

sns.set_palette("husl")
corr_columns = car_acc.corr()
corr_columns
/Users/rohitsingh/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
       Fatality_rate_Drivers   Speeders  Alcohol-Impaired  Not_Distracted  \
count              51.000000  51.000000         51.000000       51.000000   
mean               15.790196  31.725490         30.686275       85.921569   
std                 4.122002   9.633438          5.132213       15.158949   
min                 5.900000  13.000000         16.000000       10.000000   
25%                12.750000  23.000000         28.000000       83.000000   
50%                15.600000  34.000000         30.000000       88.000000   
75%                18.500000  38.000000         33.000000       95.000000   
max                23.900000  54.000000         44.000000      100.000000   

       No_Previous_Accidents     Premiums  Losses_Per_Driver  
count               51.00000    51.000000          51.000000  
mean                88.72549   886.957647         134.493137  
std                  6.96011   178.296285          24.835922  
min                 76.00000   641.960000          82.750000  
25%                 83.50000   768.430000         114.645000  
50%                 88.00000   858.970000         136.050000  
75%                 95.00000  1007.945000         151.870000  
max                100.00000  1301.520000         194.780000  
Out[3]:
Fatality_rate_Drivers Speeders Alcohol-Impaired Not_Distracted No_Previous_Accidents Premiums Losses_Per_Driver
Fatality_rate_Drivers 1.000000 -0.029080 0.199426 0.009782 -0.017942 -0.199702 -0.036011
Speeders -0.029080 1.000000 0.286244 0.131738 0.014066 0.042541 -0.061241
Alcohol-Impaired 0.199426 0.286244 1.000000 0.043380 -0.245455 -0.017451 -0.083916
Not_Distracted 0.009782 0.131738 0.043380 1.000000 -0.195265 0.019578 -0.058467
No_Previous_Accidents -0.017942 0.014066 -0.245455 -0.195265 1.000000 0.075533 0.042770
Premiums -0.199702 0.042541 -0.017451 0.019578 0.075533 1.000000 0.623116
Losses_Per_Driver -0.036011 -0.061241 -0.083916 -0.058467 0.042770 0.623116 1.000000
In [4]:
# Create a PairGrid object

sns.set_palette("husl")
sns.pairplot(car_acc)
Out[4]:
<seaborn.axisgrid.PairGrid at 0x7f93f99ed430>

Fitting a Regression model¶

Having analyzed the correlation among the various features and the outcome of interest, being the number of drivers involved in fatal accidents, the report finds that Percentage of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired (Alcohol-Impaired) has the strongest correlation. However, it is also evident that some features are correlated with one another, such as Speeders and Alcohol-Impaired.

To gain a more comprehensive understanding of the relationship between the features and the outcome, the study employs multivariate linear regression. This method calculates the association of each feature with the outcome, accounting for the effects of the other features. The study also notes that the results from multivariate regression and correlation analysis may differ, as the former adjusts for the impact of all other features, whereas the latter simply measures the strength of association between each feature and the outcome.

In some cases, the correlation coefficient and regression coefficient for the same feature may have opposite signs. This can occur when a feature is positively correlated with the outcome but also positively correlated with another feature that negatively impacts the outcome. In such instances, the indirect correlation between the two features can dominate the direct correlation with the outcome, resulting in a positive regression coefficient despite a negative correlation coefficient. This phenomenon is known as masking relationship, and the multivariate regression will help us uncover such relationships.

In [5]:
import statsmodels.api as sm

# Create the features and target DataFrames
features = car_acc[['Speeders','Alcohol-Impaired','Not_Distracted',
                    'No_Previous_Accidents']]
target = car_acc['Fatality_rate_Drivers']

# Create a design matrix with an intercept
X = sm.add_constant(features)

# Fit the linear regression model to the design matrix
model = sm.OLS(target, X).fit()

# Get the summary statistics
summary = model.summary()

# Print the summary statistics
print(summary)
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     Fatality_rate_Drivers   R-squared:                       0.050
Model:                               OLS   Adj. R-squared:                 -0.033
Method:                    Least Squares   F-statistic:                    0.6044
Date:                   Fri, 03 Feb 2023   Prob (F-statistic):              0.661
Time:                           20:03:18   Log-Likelihood:                -142.79
No. Observations:                     51   AIC:                             295.6
Df Residuals:                         46   BIC:                             305.2
Df Model:                              4                                         
Covariance Type:               nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                     8.3332     10.638      0.783      0.437     -13.081      29.747
Speeders                 -0.0432      0.065     -0.663      0.511      -0.174       0.088
Alcohol-Impaired          0.1918      0.125      1.535      0.132      -0.060       0.443
Not_Distracted            0.0059      0.040      0.147      0.884      -0.075       0.087
No_Previous_Accidents     0.0274      0.090      0.305      0.762      -0.154       0.209
==============================================================================
Omnibus:                        0.229   Durbin-Watson:                   1.917
Prob(Omnibus):                  0.892   Jarque-Bera (JB):                0.415
Skew:                          -0.103   Prob(JB):                        0.813
Kurtosis:                       2.609   Cond. No.                     2.39e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.39e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

The above regression results show that the variable Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired has a weak causal relationship with the Number of drivers involved in fatal collisions per billion miles, athough is the stongest amomgst other features. It is also crucial to note that association of alcohol comsumption with other features in the dataset. Therefore, standard statistical practice suggest to cluster the states in a manner that takes into account all these features.

Perform PCA on standardized data¶

For the purpose of clustering the data, the syudy uses the approach of principal component analysis (PCA). PCA provides a visual representation of the data in a reduced-dimensional space, which can help reveal patterns and relationships. It is also important to standardize the features prior to PCA to ensure that they are on a similar scale and that the overall variance explained by each principal component accurately reflects the relationships between the features. The study scales the features to have a mean of 0 and a standard deviation of 1, which is a common method for standardizing data.

In [6]:
# Standardize and center the feature columns
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Import the PCA class from sklearn
from sklearn.decomposition import PCA
pca = PCA()

# Fit the standardized data to the pca
pca.fit(features_scaled)

# Plot the proportion of variance explained on the y-axis of the bar plot
import matplotlib.pyplot as plt
plt.bar(range(1, pca.n_components_ + 1),  pca.explained_variance_ratio_)
plt.xlabel('Index of Principal component')
plt.ylabel('Proportion of variance explained')
plt.xticks([1, 2, 3, 4])

# Compute the cumulative proportion of variance explained by the first three principal components
three_first_comp_var_exp = pca.explained_variance_ratio_.cumsum()[2]
print("The cumulative variance of the first three principal componenets is {}".format(
    round(three_first_comp_var_exp, 5)))
The cumulative variance of the first three principal componenets is 0.85758

Visualizion¶

The utilization of the first three principal components allows for the representation of the data in a three-dimensional space while preserving a significant amount of variation (~ 86%) from the original four features, namely Speeders, Alcohol-Impaired, Not_Distracted and No_Previous_Accidents. This approach allows for the use of human visual pattern recognition in the identification of groups of similar states within the data.

To facilitate this analysis, the study uses a scatter plot, which is created by utilizing the first three principal components, enabling one to examine the clustering of states in this visual representation.

In [7]:
# Transform the scaled features using two principal components
pca = PCA(n_components = 3)
p_comps = pca.fit_transform(features_scaled)

# Extract the first and second component to use for the scatter plot
p_comp1 = p_comps[:, 0]
p_comp2 = p_comps[:, 1]
p_comp3 = p_comps[:, 2]
In [8]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(p_comp1, p_comp2, p_comp3)

plt.show()

Plotting the above diagram in an interactive way to better visualisation.¶

In [9]:
import plotly.express as px

fig = px.scatter_3d(x=p_comp1, y=p_comp2, z=p_comp3, 
                    opacity=1, width=800, height=500)
fig.show()

Find clusters of similar states in the data¶

With the goal of uncovering clusters of similar states within the data, the report employs the use of KMeans clustering. The scatter plot generated from principal component analysis (PCA) may not provide sufficient insight into the number of groups in which the states cluster. To more effectively determine an appropriate number of clusters, the study create a scree plot and to plot silhouette score. The optimal number of clusters can be selected as the number of clusters corresponding to the highest silhouette score.

In [10]:
# Import the necessary libraries
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

# Initialize an empty list to store the silhouette scores
silhouette_scores = []

# Loop over a range of values for n_clusters
for n_clusters in range(2, 11):
    # Initialize the KMeans model with the current number of clusters
    kmeans = KMeans(n_clusters=n_clusters, random_state=8)
    # Fit the model to the scaled features
    kmeans.fit(features_scaled)
    # Calculate the silhouette score for the current number of clusters
    silhouette_score_ = silhouette_score(features_scaled, kmeans.labels_)
    # Append the silhouette score to the list of silhouette scores
    silhouette_scores.append(silhouette_score_)

# Plot the silhouette scores
plt.plot(range(2, 11), silhouette_scores, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.show()

In the above plot of Numbers of cliusters and corresponding silhouette score, one can visulaise the optimal number of clusters to be four.

KMeans to visualize clusters¶

Since there wasn't a clear elbow in the scree plot, assigning the states to either two or three clusters is a reasonable choice, and we will resume our analysis using three clusters. Let's see how the PCA scatter plot looks if we color the states according to the cluster to which they are assigned.

In [11]:
import plotly.express as px

# Create a KMeans object with 4 clusters
km = KMeans(n_clusters=4, random_state=8)
# Fit the data to the `km` object
km.fit(features_scaled)

fig = px.scatter_3d(x=p_comps[:, 0], y=p_comps[:, 1], z=p_comps[:, 2], 
                    color=km.labels_,
                    color_discrete_sequence=px.colors.sequential.Plasma,
                    opacity=0.8,
                    width=800, height=500)
fig.show()

Visualize the feature differences among the clusters¶

So far, the study has deployed visual inspection of the data and the KMeans clustering algorithm to uncover patterns in the information. To further illustrate the significance of these patterns, it is crucial to recall that the information utilized to cluster the states into four separate groups was the percentage of drivers that were caught speeding, under the influence of alcohol, and those who had never been involved in a collision. We utilized these clusters to visualize how the states gather when considering the first three principal components, providing us with a better understanding of the structure in the data. However, this visualization may not be easily comprehended by those without a specialist background.

Therefore, as a logical continuation of the analysis, it would be beneficial to examine how these three clusters differ in terms of the three features used for clustering. To achieve this, the study reverts to using the unscaled features, which will aid in the interpretation of the differences.

In [12]:
# Create a new column with the labels from the KMeans clustering
car_acc['cluster'] = km.labels_

# Reshape the DataFrame to the long format
melt_car = pd.melt(car_acc, id_vars='cluster', var_name='measurement', value_name='percent',
                   value_vars=['Speeders', 'Alcohol-Impaired', 'Not_Distracted','No_Previous_Accidents'])

# Create a box plot splitting and coloring the results according to the km-clusters
sns.boxplot(y='measurement', x='percent', data=melt_car, hue='cluster')
Out[12]:
<AxesSubplot: xlabel='percent', ylabel='measurement'>

Compute the number of accidents within each cluster¶

Having identified patterns in the data through both visual interpretation and KMeans clustering, it is now necessary to delve deeper into these patterns to understand their meaning. In order to determine the most effective course of action, it is important to consider the extent to which each state is affected by factors such as the percentage of drivers who are speeding, under the influence of alcohol, or have prior accidents.

To gain a clearer understanding of the data, the unscaled features are used rather than the scaled features. The next step in the analysis involves exploring the differences among the three clusters in terms of the three features used for clustering.

It is important to allocate resources and time effectively, and so it is necessary to prioritize the most pressing issues. To accomplish this, data on the number of miles driven in each state is obtained, as it will provide insight into the total number of fatal accidents in each state. This information is incorporated into the DataFrame and a box plot is created to visualize the number of total fatal traffic accidents within each state cluster.

In [13]:
# Read in the `miles-drives.csv`
miles_driven = pd.read_csv('miles_driven.csv', sep=',')
In [14]:
# Merge the `car_acc` DataFrame with the `miles_driven` DataFrame
car_acc_miles = car_acc.merge(miles_driven, on='state')
In [15]:
#Create a new column for the number of drivers involved in fatal accidents
car_acc_miles['num_drvr_fatl'] = car_acc_miles['Fatality_rate_Drivers'] * car_acc_miles['million_miles_annually'] / 1000

# Create a barplot of the total number of accidents per cluster
sns.barplot(x='cluster', y='num_drvr_fatl', data=car_acc_miles, estimator=sum, ci=None)

# Calculate the number of states in each cluster and their 'num_drvr_fatl_col' mean and sum.
count_mean_sum = car_acc_miles.groupby('cluster')['num_drvr_fatl'].agg(['count', 'mean', 'sum'])
count_mean_sum
/var/folders/2c/xwfp87rj4ws8l8kn27fw53cr0000gn/T/ipykernel_59612/3679285735.py:5: FutureWarning:



The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.


Out[15]:
count mean sum
cluster
0 17 749.668082 12744.3574
1 13 1060.110408 13781.4353
2 18 969.818439 17456.7319
3 1 683.777600 683.7776

The findings indicate that there is no clear preference in terms of which cluster should be a focus for policy intervention and further investigation. However, based on the data analysis, it can be argued that cluster 2 should be given particular attention. Further investigation into the characteristics and circumstances surrounding this cluster may provide valuable insight into potential solutions for reducing the number of fatal traffic accidents.